Large - Scale Non - Linear Regression within the Mapreduce Framework
نویسندگان
چکیده
Large-scale Non-linear Regression within the MapReduce Framework By: Ahmed Khademzadeh Thesis Advisor: Philip Chan, Ph.D. Regression models have many applications in real world problems such as finance, epidemiology, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets would help us to construct better models from the data. The issue with big datasets is that they would need a long time to be processed or even to be read on a single machine. This research employs MapReduce to model large-scale non-linear regression problems in a parallel fashion. MRRT (MapReduce Regression Tree) algorithm divides the feature space into overlapping subspaces and then shuffles each of the subspace’s data items to a node in the cluster. Each node in the cluster then constructs a regression tree for the subspace of the data it has received. Different versions of algorithm (overlapping/non-overlapping subspaces and weighted/unweighted prediction using neighboring models) are proposed and compared with the regression tree (RT) algorithm implemented in Matlab libraries. Experiments on synthetic and real datasets show that MRRT algorithm that is devised to be fast and scalable for MapReduce framework not only has a close to linear speedup, and close to optimum scalability, but also outperforms the RT algorithm in terms of accuracy (in most cases) and improves the prediction time by more than 80%. Although MRRT is designed for MapReduce framework, it could be used on a single machine, and in that case it improves the learning time by 60% (in most cases) comparing to RT algorithm, and shows to be of close to linear scalability (comparing to RT algorithm which is roughly of quadratic scalability).
منابع مشابه
Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments
Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...
متن کاملRobust Regression on MapReduce
Although the MapReduce framework is now the de facto standard for analyzing massive data sets, many algorithms (in particular, many iterative algorithms popular in machine learning, optimization, and linear algebra) are hard to fit into MapReduce. Consider, e.g., the `p regression problem: given a matrix A ∈ Rm×n and a vector b ∈ R, find a vector x∗ ∈ R that minimizes f(x) = ‖Ax− b‖p. The widel...
متن کاملUtilizing Kernel Adaptive Filters for Speech Enhancement within the ALE Framework
Performance of the linear models, widely used within the framework of adaptive line enhancement (ALE), deteriorates dramatically in the presence of non-Gaussian noises. On the other hand, adaptive implementation of nonlinear models, e.g. the Volterra filters, suffers from the severe problems of large number of parameters and slow convergence. Nonetheless, kernel methods are emerging solutions t...
متن کاملParallel Implementation of Multiple Linear Regression Algorithm Based on MapReduce
The amount of data generated by traditional business activities, creating data repositories ranging from terabytes to petabytes in size. However, this information cannot be practically analyzed on a single commodity computer because the data is too large to fit in memory. For this purpose, the large size of data to be processed requires the use of high-performance analytical systems running on ...
متن کاملParallel extreme learning machine for regression based on MapReduce
Regression is one of the most basic problems in data mining. For regression problem, extreme learning machine (ELM) can get better generalization performance at a much faster learning speed. However, the enlarging volume of datasets makes regression by ELM on very large scale datasets a challenging task. Through analyzing the mechanism of ELM algorithm, an efficient parallel ELM for regression ...
متن کامل